This document explores some options for trained ensembles we could start using for COVID-19.
These scores summarize model skill for each combination of base target and spatial scale.
For brevity, we'll look here at performance for a subset of the variations on "trained" approaches that we have considered. Below are the settings we're examining, and reasons we chose them from among the alternatives.
Within these settings, we explore variations in the training set window size (the number of past weeks of forecasts used to estimate ensemble weights).
We also consider three quantile grouping strategies: "per model" weights, "per quantile" approaches where there is a separate weight parameter for each combination of model and quantile level, and "3 groups" of quantile levels: the three lowest, the three highest, and the middle ones.
We compare to two "untrained" ensembles: an equally-weighted mean (ew) at each quantile level and a median at each quantile level.
We perform estimation either separately for each spatial scale (National, State, and County), or jointly across the State and National levels.
The overall average scores in the tables below are computed across a comparable set of forecasts for all models, determined by the model evaluated with the fewest available forecasts (corresponding to a training set window of 10). For incident deaths, the relative rankings of median and mean ("ew") can change as a few more weeks are added or removed from the evaluation set. Per-week scores plotted further down are computed across a comparable set of forecasts for all models that are available within each week.
National level mean scores across comparable forecasts for all methods.
State level mean scores across comparable forecasts for all methods:
County level mean scores across comparable forecasts for all methods:
National level mean scores across comparable forecasts for all methods.
State level mean scores across comparable forecasts for all methods:
National level mean scores across comparable forecasts for all methods:
State level mean scores across comparable forecasts for all methods:
National level mean scores across comparable forecasts for all methods:
State level mean scores across comparable forecasts for all methods:
The high WIS for the equal weighted mean here is not a bug -- one forecast was crazy high in the upper tail; this shows up in WIS but not in the other metrics.
In these plots we show results for the mean, median, and the top-performing convex approach within each combination of base target and spatial scale.
For readability, we also drop the score for the unweighted mean ensemble forecast of state level cumulative deaths in the week where that method had very high WIS.
The following interactive figures provide some insight into which models would have received weight in some of the different ensemble specifications. Each plot shows weights over time faceted by target variable in columns (inc case and inc death) and geographic level (state and national level) in rows. The versions of the ensemble shown here provide the same model weights for each state.
These plots show the forecasts of the "top weighted models" (i.e. models with more than 1% weight) in a given week.
Other short term investigations/refinements to explore:
This section displays heat maps showing score availability by date, target_variable, spatial scale, and model. In each cell, we expect to see a number of scores equal to the number of locations for the given spatial scale times the number of horizons for the given target.
There are some unexpected differences in forecast availability at the state level across different models showing up here -- I need to investigate this more.
Here we have subset the forecasts to those that are comparable across all models within each combination of base target and spatial scale. We expect to see the exact same score counts for all models within each plot facet. Average scores computed within a combination of base target and spatial scale will be comparable.
Here we have subset the forecasts to those that are comparable across all models within each combination of base target, spatial scale, and week. We expect to see the exact same score counts within each column of the plot, for all models for which any forecasts are available. Average scores computed within a combination of base target, spatial scale, and forecast week will be comparable.